Skip to content

vmray: support parsing flog.txt (Download Function Log)#2878

Open
devs6186 wants to merge 3 commits intomandiant:masterfrom
devs6186:feature/2452-vmray-flog-txt
Open

vmray: support parsing flog.txt (Download Function Log)#2878
devs6186 wants to merge 3 commits intomandiant:masterfrom
devs6186:feature/2452-vmray-flog-txt

Conversation

@devs6186
Copy link
Contributor

closes #2452

Adds support for parsing VMRay's flog.txt format — the free "Download Function Log" available from VMRay Threat Feed → Full Report → Download Function Log. Users no longer need the full analysis ZIP archive to run capa against VMRay output.

What changed

File Change
capa/features/extractors/vmray/flog_txt.py New parser: header validation, Process/Thread/Region block splitting, API trace line parsing, sys_ prefix stripping
capa/features/extractors/vmray/__init__.py VMRayAnalysis.from_flog_txt() — builds analysis object from standalone flog.txt (no ZIP)
capa/features/extractors/vmray/extractor.py VMRayExtractor.from_flog_txt() — convenience classmethod
capa/helpers.py Detect flog.txt by filename + header magic in get_format_from_extension; updated unsupported-format error message to mention flog.txt
capa/loader.py Route flog.txt inputs through VMRayExtractor.from_flog_txt in both get_extractor and get_file_extractors
tests/test_vmray_flog_txt.py 5 unit tests: minimal parse, header rejection, sys_ stripping, VMRayAnalysis construction, VMRayExtractor construction
doc/usage.md Updated CAPE row to mention VMRay flog.txt alongside VMRay ZIP

Usage

capa path/to/flog.txt --backend vmray

Notes

  • Static features (imports, exports, sections, strings) are unavailable from flog.txt alone — only dynamic API call features are extracted
  • The failing tests in tests/test_vmray_features.py are pre-existing and unrelated: they require the large ZIP test fixture (tests/data/dynamic/vmray/...) which is not part of this repo

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed
  • This submission includes AI-generated code and I have provided details in the description.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @devs6186, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances capa's capability to analyze VMRay dynamic analysis reports by introducing direct support for flog.txt files. Previously, capa required the full VMRay analysis ZIP, but now users can leverage the free "Download Function Log" to extract dynamic API call features. This change streamlines the analysis workflow for users who only have access to the flog.txt output, making capa more versatile and accessible for VMRay report analysis.

Highlights

  • VMRay flog.txt Parsing: Implemented a new parser for VMRay's flog.txt format, allowing capa to analyze dynamic API calls directly from this file.
  • Standalone Analysis: Enabled capa to run against flog.txt files without requiring the full VMRay analysis ZIP archive, leveraging the free 'Download Function Log'.
  • Integration with capa Loader: Updated the capa loader and helper functions to automatically detect and route flog.txt inputs to the new VMRay extractor.
  • Changelog and Documentation: Added an entry to the CHANGELOG.md and updated doc/usage.md to reflect the new flog.txt support.
  • Unit Tests: Introduced new unit tests to validate the flog.txt parser, including header validation, sys_ prefix stripping, and VMRayAnalysis/VMRayExtractor construction.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CHANGELOG.md
    • Added a new feature entry for VMRay flog.txt support.
  • capa/features/extractors/vmray/init.py
    • Imported new models (AnalysisMetadata, FileHashes) and the flog_txt module.
    • Updated SUPPORTED_FLOG_VERSIONS to include "1" for flog.txt.
    • Added a from_flog_txt class method to VMRayAnalysis to construct an analysis object from a standalone flog.txt file.
  • capa/features/extractors/vmray/extractor.py
    • Added a from_flog_txt class method to VMRayExtractor for convenience in building an extractor from a flog.txt path.
  • capa/features/extractors/vmray/flog_txt.py
    • Added a new module containing functions to parse VMRay flog.txt content, including header validation, process/thread/region block splitting, API trace line parsing, and sys_ prefix stripping.
  • capa/helpers.py
    • Implemented logic in get_format_from_extension to detect flog.txt files based on filename and header magic.
    • Updated the unsupported format error message to mention flog.txt as a supported VMRay report type.
  • capa/loader.py
    • Modified get_extractor and get_file_extractors to conditionally use VMRayExtractor.from_flog_txt when the input file is detected as flog.txt.
  • doc/usage.md
    • Updated the "Ways to consume capa output" table to explicitly mention VMRay flog.txt alongside VMRay ZIP for CAPE integration.
  • tests/test_vmray_flog_txt.py
    • Added a new test file with unit tests for flog.txt parsing, including minimal parsing, header rejection, sys_ prefix stripping, and construction of VMRayAnalysis and VMRayExtractor objects.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for parsing VMRay's flog.txt format, which is a great addition for users who don't have access to the full analysis ZIP. The implementation is generally sound and integrates well with the existing VMRay extractor. I have identified a few areas for improvement regarding robustness against malformed input and the completeness of the extracted features (specifically API arguments).

Comment on lines 43 to 47
def _parse_hex_or_decimal(s: str) -> int:
s = s.strip().strip('"')
if s.startswith("0x") or s.startswith("0X"):
return int(s, 16)
return int(s, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _parse_hex_or_decimal function is not robust against empty strings. If a property in the flog.txt file is present but has no value (e.g., os_pid = ), int(s, 10) will raise a ValueError, causing the parser to crash. It would be safer to handle empty strings by returning a default value (like 0) or skipping the property.

Suggested change
def _parse_hex_or_decimal(s: str) -> int:
s = s.strip().strip('"')
if s.startswith("0x") or s.startswith("0X"):
return int(s, 16)
return int(s, 10)
def _parse_hex_or_decimal(s: str) -> int:
s = s.strip().strip('"')
if not s:
return 0
if s.lower().startswith("0x"):
return int(s, 16)
return int(s, 10)

thread_blocks = [p.strip() for p in parts[1:] if p.strip()]

# First part: Process properties then Region: blocks
process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The split by \nRegion:\n is less flexible than the regex-based splits used for Process: and Thread: blocks. If the log file contains trailing spaces after Region:, the split will fail to isolate the process properties. For consistency and robustness, consider using a regex similar to the ones used in lines 122 and 200.

Suggested change
process_props = _parse_properties(header_and_regions.split("\nRegion:\n")[0])
process_props = _parse_properties(re.split(r"\n\s*Region:\s*\n", header_and_regions)[0])

Comment on lines 167 to 168
params_in=None, # flog.txt args could be parsed later into Param list
params_out=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Currently, API call arguments are not being parsed and are set to None. Since many capa rules rely on specific argument values (e.g., registry keys, file paths, or flags), this significantly limits the effectiveness of the extractor when using flog.txt. While the comment acknowledges this as a future improvement, implementing even a basic parser for the args_str extracted in _parse_event would greatly enhance the utility of this new feature.

@devs6186
Copy link
Contributor Author

Thanks for the review! I've addressed all three suggestions in eca9286:

  1. Empty string handling in _parse_hex_or_decimal — now returns 0 for empty/missing values instead of crashing.

  2. Region block splitting — switched from split("\nRegion:\n") to re.split(r"\n\sRegion:\s\n", ...) for whitespace robustness, consistent with the Process/Thread splits.

  3. API argument parsing — implemented _parse_args() that extracts name=value pairs from the trace lines into Param
    objects. String values are modelled as void_ptr + str deref (matching the XML convention) so String features are yielded;
    numeric values use unsigned_32bit so Number features are yielded. Added a new test covering string, numeric, and no- arg calls.

Please re-review

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. what are the pros/cons of using flog versus the full archive? we should document that clearly somewhere
  2. aside: could we provide a helper script that, given a sample hash, automatically retrieves the flog from vmray and shows the results?
  3. we need a reasonable collection of flog files committed to testfiles and run during CI. especially with the string parsing, which tends to be brittle, we need infrastructure in place to find regressions and bugs.

@williballenthin
Copy link
Collaborator

thanks @devs6186

devs6186 added a commit to devs6186/capa that referenced this pull request Feb 23, 2026
… flog.txt

Addresses reviewer feedback on mandiant#2878:

1. Document flog.txt vs full archive trade-offs in doc/usage.md with a
   comparison table (available features, how to obtain, file size).

2. Add scripts/fetch-vmray-flog.py — given a VMRay instance URL, API key,
   and sample SHA-256, downloads flog.txt via the REST API and optionally
   runs capa against it.

3. Add fixture-based regression tests (tests/fixtures/vmray/flog_txt/) with
   three representative flog.txt files:
   - windows_apis.flog.txt: Win32 APIs, string args with backslash paths,
     numeric args, multi-process
   - linux_syscalls.flog.txt: Linux sys_-prefixed calls (all stripped)
   - string_edge_cases.flog.txt: paths with spaces, UNC paths, URLs, empty

   tests/test_vmray_flog_txt.py gains 14 new feature-presence tests covering
   API, String, and Number extraction at the call scope, plus negative checks
   (double-backslash must not appear; sys_ prefix must not appear).

Fixes mandiant#2878
@devs6186
Copy link
Contributor Author

  1. what are the pros/cons of using flog versus the full archive? we should document that clearly somewhere
  2. aside: could we provide a helper script that, given a sample hash, automatically retrieves the flog from vmray and shows the results?
  3. we need a reasonable collection of flog files committed to testfiles and run during CI. especially with the string parsing, which tends to be brittle, we need infrastructure in place to find regressions and bugs.
  1. what are the pros/cons of using flog versus the full archive? we should document that clearly somewhere
  2. aside: could we provide a helper script that, given a sample hash, automatically retrieves the flog from vmray and shows the results?
  3. we need a reasonable collection of flog files committed to testfiles and run during CI. especially with the string parsing, which tends to be brittle, we need infrastructure in place to find regressions and bugs.

hey @williballenthin , I have addressed everything in the latest commit —

for the docs i added a comparison section in usage.md with a table laying out exactly what you get from
flog.txt vs the full archive. figured that's the clearest place for it since people land there when figuring
out how to use capa.

for the fetch script - scripts/fetch-vmray-flog.py takes a hash + api key, looks up the sample, grabs the
most recent analysis and downloads the function log. added a --run-capa flag too so you can go from hash to
capa output in one shot. i don't have a vmray instance to test the exact endpoints against so i went with
what the REST api docs suggest and added a fallback, but if the endpoint paths are off for your setup they
should be easy to adjust.

for the fixtures I added three flog.txt files under tests/fixtures/vmray/flog_txt/ (in the main repo so CI
always has them without needing testfiles): a windows one with backslash paths and multi-process, a linux one
with 22 sys_ calls, and one that specifically targets the brittle stuff — paths with spaces, UNC paths,
URLs. the test count went from 6 to 20, including negative checks (double-backslash form must not appear in
features, sys_-prefixed names must not appear).

on adding real samples to testfiles , I am totally happy to do that as a follow-up, just wanted to point out that it needs a separate PR to the submodule. if you have particular samples in mind let me know and i'll set it up.

Adds a parser for the VMRay flog.txt format (the free "Download Function
Log" available from VMRay Threat Feed -> Full Report). Users no longer
need the full ZIP archive to run capa against VMRay output.

- capa/features/extractors/vmray/flog_txt.py: new parser for flog.txt
  header validation, Process/Thread/Region block splitting, API trace
  line parsing, sys_ prefix stripping
- VMRayAnalysis.from_flog_txt() and VMRayExtractor.from_flog_txt() for
  constructing the extractor from a standalone flog.txt
- helpers.py: detect flog.txt by filename + header magic; update
  unsupported-format error message to mention flog.txt
- loader.py: route flog.txt inputs through VMRayExtractor.from_flog_txt
- tests/test_vmray_flog_txt.py: 5 unit tests covering parse, header
  rejection, sys_ stripping, analysis and extractor construction

Fixes mandiant#2452
- Handle empty strings in _parse_hex_or_decimal (return 0 instead of crash)
- Use regex for Region: block splitting (consistent with Process:/Thread:)
- Parse API call arguments into Param objects so String/Number features
  are extracted (string args use void_ptr+str deref to match XML convention)
- Use FunctionCall.model_validate instead of __init__ to work around
  Pydantic alias "in" clashing with Python keyword
- Add test_parse_flog_txt_args_parsed covering string, numeric, and
  no-arg API calls
… flog.txt

Addresses reviewer feedback on mandiant#2878:

1. Document flog.txt vs full archive trade-offs in doc/usage.md with a
   comparison table (available features, how to obtain, file size).

2. Add scripts/fetch-vmray-flog.py — given a VMRay instance URL, API key,
   and sample SHA-256, downloads flog.txt via the REST API and optionally
   runs capa against it.

3. Add fixture-based regression tests (tests/fixtures/vmray/flog_txt/) with
   three representative flog.txt files:
   - windows_apis.flog.txt: Win32 APIs, string args with backslash paths,
     numeric args, multi-process
   - linux_syscalls.flog.txt: Linux sys_-prefixed calls (all stripped)
   - string_edge_cases.flog.txt: paths with spaces, UNC paths, URLs, empty

   tests/test_vmray_flog_txt.py gains 14 new feature-presence tests covering
   API, String, and Number extraction at the call scope, plus negative checks
   (double-backslash must not appear; sys_ prefix must not appear).

Fixes mandiant#2878
@devs6186 devs6186 force-pushed the feature/2452-vmray-flog-txt branch from b58fbeb to 548d814 Compare February 24, 2026 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Parsing VMRay Log flog.txt for capa

2 participants